https://github.com/psboonstra/ModelSelection699

Does this model spark joy?

Definition model selection is “estimating the performance of different models in order to choose the best one” (Hastie, Tibshirani, and Friedman 2009)

What it might include

  • Choosing structure (GLM, nonlinear model, generalized additive model, discrete/continuous time model, non-likelihood-based [tree, other machine learning methods])
  • What variables to include
  • How to parametrize (if necessary)

Theory versus practice

Process in BIOS 650, 651, 653, 675, etc is usually

  1. fit model
  2. report results

Process in many data analyses is

  1. explore data
  2. make preliminary decisions
  3. fit initial models
  4. fit final model
  5. report results
  6. assess performance

Key idea of this lecture: steps 1–3 impact steps 6 and 7

Simulation study to fix ideas

  • \(X \overset{iid}{\sim} MVN(0, \Sigma(\rho))\); \(\epsilon\overset{iid}{\sim} N(0, 1)\)

  • \(Y = X\beta + \sigma\epsilon\)

  • length of \(\beta\) is \(p\)

  • \(n\) observations from model

  • \(R^2 = \mathrm{var}(X\beta)/[\mathrm{var}(X\beta) + \sigma^2]\)

  • Compare two approaches:

  • No variable selection: fit multiple regression with all variables

  • Variable selection: exclude all variables with univariable p-value \(> 0.10\), then fit multiple regression with selected variables

Simulation study, cont’d

library(tidyverse); library(broom);
set.seed(1);
n_sim <- 2e3;
#subjects / simulations
n_subj <- 100;
#compound symmetric correlation
rho <- 0.10;
#true generating betas
p_null <- 19;
#one non-zero beta;
beta <- c(1, numeric(p_null));
#true_rsq implies value of sigma
true_rsq <- 0.2;
#remove all estimates with univariable p-value exceeding 
p.value_threshold <- 0.10;
source("varselect_sim.R");

Distribution of test statistics (null parameters)

No variable selection

Distribution of test statistics (null parameters)

Variable selection

Number false positives (using \(p<0.05\))

No variable selection

full_model_coefs %>% 
  filter(term != "x.1") %>%
  group_by(sim_id) %>%
  summarize(num_fd = sum(p.value < 0.05)) %>%
  summarize(mean_num_fd = mean(num_fd), 
            median_num_fd = median(num_fd));
## # A tibble: 1 × 2
##   mean_num_fd median_num_fd
##         <dbl>         <dbl>
## 1       0.904             1

Number false positives (using \(p<0.05\))

Variable selection

selected_model_coefs %>% 
  filter(term != "x.1") %>%
  group_by(sim_id) %>%
  summarize(num_fd = sum(p.value < 0.05)) %>%
  summarize(mean_num_fd = mean(num_fd), 
            median_num_fd = median(num_fd));
## # A tibble: 1 × 2
##   mean_num_fd median_num_fd
##         <dbl>         <dbl>
## 1       0.518             0

Estimated \(R^2\)

No variable selection

Estimated \(R^2\)

Variable selection

Comparison

  • No variable selection approach unbiasedly estimated regression coefficients but may be overfitting data

  • Variable selection approach biasedly estimated regression coefficients but with lower variance and fewer false positives

Careless modeling

Some consequences:

  • biased coefficient estimates

  • non-nominal significance

  • false positive findings

  • false certainty

  • tendency to overstate proportion of variance explained

Variable selection entails a `bias-variance’ trade-off

What can be done

  1. Scientific context should be incorporated wherever possible:
  • what is my primary question?

  • do prior models exist? how can I incorporate that information?

  • are there known constraints? e.g. design features, monotonicity, positivity

  1. Data can be used to suggest models
  • When used for hypothesis generating, called `exploratory data analysis’ (Tukey 1980)

  • SAS / R / STATA have automatic, data-dependent selection routines built in

  • When conducted uncritically, ‘data dredging’ or ‘data torture’ are better descriptors

Topics we’ll be covering

This lecture:

  • AIC

  • forward / backward selection

Future lecture(s):

  • Cross-validation

  • LASSO

Model size (including \(\beta_0\), \(\sigma\))

Variable selection

Maximized likelihood biased upward relative to expectation

  • Ideal setting: training data to fit model and independent validation data to select model.

  • Suppose we have independent observations of outcome given same covariates: \(Y_\mathrm{new} = X\beta + \sigma\epsilon_\mathrm{new}\)

  • Akaike (1973) asymptotically linked expected log-likelihood of \(Y_\mathrm{new}\) to observed likelihood of \(Y\)

  • Let \(\{\hat\beta_0, \hat\beta,\hat\sigma\}\) be estimated parameters after model building. Then

\[ \begin{aligned} \mathrm{LogLik}_\mathrm{new} &\approx \mathrm{Loglik} - \mathrm{Model~Size}\\ \Rightarrow f(Y_\mathrm{new}|\hat\beta_0, \hat\beta,\hat\sigma) &\approx f(Y|\hat\beta_0, \hat\beta,\hat\sigma) - \#\{\hat\beta\neq0\} - 2 \end{aligned} \]

\(\mathrm{LogLik}_\mathrm{new}- \mathrm{Loglik} + \mathrm{Model~Size}\)

No variable selection

(small dot indicates mean value)

\(\mathrm{LogLik}_\mathrm{new}- \mathrm{Loglik} + \mathrm{Model~Size}\)

Variable selection

(small dot indicates mean value)

Comparison of observed vs. adjusted likelihoods

Akaike’s Information Criterion (AIC)

  • AIC defined by \(-2\times\mathrm{Loglik} + 2\times\mathrm{Model~Size}\) (-2 multiplier being for “historical reasons”)

It is…

  • metric useful for comparing relative distance of various fitted models to true generating model (smaller is better)

  • approximation of expected log-likelihood of your fitted model when sample size is large relative to number of predictors

  • more than intuition that subtracting penalty from model fit is a reasonable thing to do

Akaike’s Information Criterion (AIC)

  • AIC defined by \(-2\times\mathrm{Loglik} + 2\times\mathrm{Model~Size}\) (-2 multiplier being for “historical reasons”)

It is not

  • absolute measure of model fit

  • good approximation of its estimand when sample size is small relative to number of predictors

  • useful to compare models fit to non-identical sets of observations

  • useful to compare models for different transformations of outcome

AIC and REML

  • In general, AIC requires likelihoods given ML-estimated parameters, not REML-estimated (see Section 4.7.2, Gałecki and Burzykowski (2013))

  • Possible to compare models using ML-estimates, then report REML-estimates from selected model

  • See also https://tinyurl.com/yagpx4bv (StackExchange:CrossValidated)

Small-sample AIC

  • Hurvich and Tsai (1989) proposed a faster-converging “corrected” AIC with an adjusted estimate of model size:

\[ \begin{aligned} \mathrm{AIC}_C = -2\times\mathrm{Loglik} + 2\times\mathrm{Model~Size}\times\left(\dfrac{n}{n - \mathrm{Model~Size} -1}\right) \end{aligned} \]

  • To account for \(p \gg n\), Boonstra, Mukherjee, and Taylor (2015) suggest instead positive part of correction: \(\left(\dfrac{n}{n - \mathrm{Model~Size} -1}\right)_+\)

  • Phil’s opinion: always use \(\mathrm{AIC}_C\)

\(\mathrm{LogLik}_\mathrm{new}- \mathrm{Loglik} + \mathrm{Model~Size}\times\left(\tfrac{n}{n - \mathrm{Model~Size} -1}\right)\)

No variable selection

(small dots indicate mean values)

\(\mathrm{LogLik}_\mathrm{new}- \mathrm{Loglik} + \mathrm{Model~Size}\times\left(\tfrac{n}{n - \mathrm{Model~Size} -1}\right)\)

Variable selection

(small dots indicate mean values)

\(\mathrm{AIC}_C\) favors smaller models

Selecting variables

  • When considering what variables from a list to include, AIC / AIC\(_C\) are means of comparing them

  • But cannot compare all \(2^p\) possible models when \(p\) exceeds 20–30.

This grief has a gravity, it pulls me down But a tiny voice whispers in my mind You are lost, hope is gone But you must go on… Just do the next right thing Take a step, step again It is all that I can to do The next right thing

This grief has a gravity, it pulls me down But a tiny voice whispers in my mind You are lost, hope is gone But you must go on… Just do the next right thing Take a step, step again It is all that I can to do The next right thing

Anna (Frozen 2)

anna

Sequential selection procedures

  • Forward selection starts with intercept, adds variables sequentially that “improve” model by greatest amount

  • Backward selection starts with full model, subtracts variables presence of which “worsen” model by greatest amount

MASS::stepAIC() function does automatic forward/backward selection using AIC as measure of model fit

AIC is defined only up to constant

R functions AIC() and extractAIC() (which stepAIC() uses) employ different constants. From extractAIC help page:

For linear models with unknown scale (i.e., for lm and aov), -2 log L is computed from the deviance and uses a different additive constant to logLik and hence AIC.

For those interested, the difference between AIC() and extractAIC() for linear models is \(n(\log(2\pi) + 1)\)

Example: Breast Cancer Diagnosis data

Digitized image of fine needle aspirate (FNA) of breast mass from 569 patients

cells

Street, Wolberg, and Mangasarian (1993) Mangasarian, Street, and Wolberg (1995)

Outcome Clinical diagnosis (malignant or benign)

Predictors

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

Data include mean, standard error, and worst measurements.

Available at https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)

breast_dx <-
  read_csv("bdiag.csv") %>%
  # Translate M/D into 1/0
  mutate(malignant = 1 * (diagnosis == "M")) %>% 
  # Drop errant space in 'concave points_mean' variable name 
  rename_with(~str_replace(string = ., pattern = " ", replacement = "")) %>%
  # Focus only on worst measurements
  select(malignant, 
         #contains("_mean"), 
         #contains("_se"), 
         contains("_worst")) 

Plan to fit a logistic regression model to predict probability of being malignant/benign using set of worst measurements

ggcorrplot(cor(select(breast_dx, -malignant)), hc.order = T)

Bootstrap demonstrates selection instability

  1. Sample equal sized dataset with replacement from the original dataset
  2. Run forward selection process bootstrap dataset, separately for each criterion
  3. Record selected model
  4. Repeat steps 1–3 \(B=2000\) times

Comparison of percent selected

All models with AIC or AICc no greater than 1 from smallest value

Total of 142 distinct variable combinations selected across the B = 2000 bootstrap datasets

Variable set Pct. Sel. AIC Delta(Obs. AIC) Pct. Sel. AICc Delta(Obs. AICc)
area, compactness, concavepoints, concavity, smoothness, symmetry, texture 0.7% -1.28 0.7% -1.23
area, concavepoints, smoothness, texture 1.2% -1.10 1.2% -1.20
area, concavepoints, smoothness, symmetry, texture 0.9% -1.01 0.9% -1.06
area, compactness, concavepoints, smoothness, symmetry, texture 0.2% -0.85 0.2% -0.85
area, concavepoints, fractal_dimension, smoothness, symmetry, texture 0.2% -0.27 0.2% -0.27
area, compactness, concavepoints, concavity, smoothness, texture 0.2% -0.09 0.2% -0.09
area, concavepoints, perimeter, smoothness, symmetry, texture 12.2% 0.00 12.2% 0.00
area, concavepoints, radius, smoothness, texture 0.3% 0.01 0.3% -0.04
area, concavepoints, radius, smoothness, symmetry, texture 0.1% 0.03 0.1% 0.03
area, concavepoints, perimeter, smoothness, texture 20.9% 0.03 20.9% -0.02
area, compactness, concavepoints, radius, smoothness, symmetry, texture 0.2% 0.30 0.2% 0.36
area, concavepoints, concavity, smoothness, texture 0.1% 0.31 0.1% 0.26
area, compactness, concavepoints, smoothness, texture 0.1% 0.32 0.1% 0.27
area, concavepoints, concavity, fractal_dimension, smoothness, symmetry, texture 0.1% 0.42 0.1% 0.47
area, concavepoints, fractal_dimension, smoothness, texture 0.1% 0.58 0.1% 0.53
area, compactness, concavepoints, concavity, radius, smoothness, symmetry, texture 0.1% 0.62 0.1% 0.74
area, compactness, concavepoints, concavity, fractal_dimension, smoothness, symmetry, texture 0.1% 0.69 0.1% 0.82
area, concavepoints, fractal_dimension, radius, smoothness, symmetry, texture 0.1% 0.70 0.1% 0.76
area, compactness, concavepoints, concavity, perimeter, smoothness, symmetry, texture 4.2% 0.72 4.0% 0.84
area, concavepoints, concavity, perimeter, smoothness, texture 3.0% 0.92 3.0% 0.92
area, compactness, concavepoints, perimeter, smoothness, symmetry, texture 1.2% 1.06 1.3% 1.12
area, concavepoints, concavity, fractal_dimension, smoothness, texture 0.1% 1.12 0.1% 1.12
area, compactness, concavepoints, fractal_dimension, smoothness, symmetry, texture 0.1% 1.13 0.1% 1.19
area, concavepoints, fractal_dimension, perimeter, smoothness, symmetry, texture 2.5% 1.19 2.6% 1.25
area, concavepoints, concavity, perimeter, smoothness, symmetry, texture 0.7% 1.36 0.7% 1.42
area, concavepoints, perimeter, radius, smoothness, symmetry, texture 1.4% 1.49 1.4% 1.55
area, compactness, concavepoints, radius, smoothness, texture 0.1% 1.52 0.1% 1.52
area, concavepoints, perimeter, radius, smoothness, texture 2.5% 1.55 2.5% 1.55
area, concavepoints, concavity, fractal_dimension, perimeter, smoothness, symmetry, texture 5.8% 1.68 5.8% 1.80
area, compactness, concavepoints, concavity, perimeter, smoothness, texture 1.2% 1.78 1.2% 1.84
area, concavepoints, concavity, fractal_dimension, radius, smoothness, symmetry, texture 0.1% 1.83 0.1% 1.95
area, concavepoints, fractal_dimension, perimeter, smoothness, texture 0.8% 1.94 0.8% 1.94
area, compactness, concavepoints, perimeter, smoothness, texture 0.3% 1.94 0.3% 1.94
Subtotal 62.0% 62.0%

Comparison of three models fit to observed breast DX data

Mechanism logOR SE logOR/SE
area_worst Best AIC 0.01 0.00 6.06
area_worst Forward selected 0.02 0.01 3.06
area_worst Most common 0.02 0.01 3.06
compactness_worst Best AIC -9.31 4.62 -2.02
concavepoints_worst Best AIC 36.89 15.18 2.43
concavepoints_worst Forward selected 38.47 14.21 2.71
concavepoints_worst Most common 38.47 14.21 2.71
concavity_worst Best AIC 5.06 2.75 1.84
perimeter_worst Forward selected -0.10 0.10 -1.00
perimeter_worst Most common -0.10 0.10 -1.00
smoothness_worst Best AIC 52.74 20.17 2.61
smoothness_worst Forward selected 45.81 19.97 2.29
smoothness_worst Most common 45.81 19.97 2.29
symmetry_worst Best AIC 9.40 5.64 1.67
symmetry_worst Forward selected 6.94 5.14 1.35
symmetry_worst Most common 6.94 5.14 1.35
texture_worst Best AIC 0.28 0.06 4.78
texture_worst Forward selected 0.28 0.06 4.79
texture_worst Most common 0.28 0.06 4.79

Summary thoughts

AIC, AIC\(_C\) require two ingredients

\(\mathrm{AIC}\), \(\mathrm{AIC}_C\) useful model selection technique depending upon being able to do two things:

  • calculate likelihood

  • count parameters (not always trivial)

When one or both of these is not calculable, must use other options, e.g. cross-validation / split-sampling

There are better options than stepwise selection

  • Least absolute shrinkage and selection operator (LASSO)

  • Elastic net

Be familiar with your data / interrogate your model

Model selection is not model assessment

Focus of talk has been on model selection (“estimating the performance of different models in order to choose the best one”), distinct from model assessment (“having chosen a final model, estimating its prediction error…on new data”, Hastie, Tibshirani, and Friedman (2009))

Standard errors of selected model do not account for selection process. Ideal assessment requires additional (third) data set: testing data (future lecture (?))

References

Akaike, Hirotsugu. 1973. “Information Theory and an Extension of the Maximum Likelihood Principle.” Second International Symposium on Information Theory, 267–81.

Boonstra, Philip S, Bhramar Mukherjee, and Jeremy M G Taylor. 2015. “A Small-Sample Choice of the Tuning Parameter in Ridge Regression.” Statistica Sinica 25 (3): 1185–1206.

Gałecki, Andrzej, and Tomasz Burzykowski. 2013. Linear Mixed-Effects Models Using r: A Step-by-Step Approach. Springer Science & Business Media.

Hastie, T, R Tibshirani, and J Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York, NY: Springer.

Hurvich, Clifford M., and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76: 297–307.

Mangasarian, Olvi L, W Nick Street, and William H Wolberg. 1995. “Breast Cancer Diagnosis and Prognosis via Linear Programming.” Operations Research 43 (4): 570–77.

Street, W Nick, William H Wolberg, and Olvi L Mangasarian. 1993. “Nuclear Feature Extraction for Breast Tumor Diagnosis.” In Biomedical Image Processing and Biomedical Visualization, 1905:861–70. International Society for Optics; Photonics.

Tukey, John W. 1980. “We Need Both Exploratory and Confirmatory.” The American Statistician 34 (1): 23–25.